Automatic Detection of Cognates Using Orthographic Alignment
نویسندگان
چکیده
Words undergo various changes when entering new languages. Based on the assumption that these linguistic changes follow certain rules, we propose a method for automatically detecting pairs of cognates employing an orthographic alignment method which proved relevant for sequence alignment in computational biology. We use aligned subsequences as features for machine learning algorithms in order to infer rules for linguistic changes undergone by words when entering new languages and to discriminate between cognates and non-cognates. Given a list of known cognates, our approach does not require any other linguistic information. However, it can be customized to integrate historical information regarding language evolution.
منابع مشابه
Automatic Detection of Orthographics Cues for Cognate Recognition
Present-day machine translation technologies crucially depend on the size and quality of lexical resources. Much of recent research in the area has been concerned with methods to build bilingual dictionaries automatically. In this paper we propose a methodology for the automatic detection of cognates between two languages based solely on the orthography of words. From a set of known cognates, t...
متن کاملA Dictionary-Based Approach for Evaluating Orthographic Methods in Cognates Identification
In this paper we propose a method for identifying cognates based on etymology and etymons. We employ this approach to evaluate the extent to which lexical similarity can be used for automatic detection of cognate pairs. We investigate some orthographic approaches widely used in this research area and some original metrics as well. We apply this procedure for Romanian and its most closely relate...
متن کاملCognates and Word Alignment in Bitexts
We evaluate several orthographic word similarity measures in the context of bitext word alignment. We investigate the relationship between the length of the words and the length of their longest common subsequence. We present an alternative to the longest common subsequence ratio (LCSR), a widely-used orthographic word similarity measure. Experiments involving identification of cognates in bite...
متن کاملString Similarity Measures and PAM-like Matrices for Cognate Identification
We present a new automatic learning system for the identification of cognates, words that derive from a common ancestor and share the same etymological origin. Our approach combines and adapts several techniques developed for biological sequence analysis to the natural language processing environment. We design a linguistic-inspired matrix to align sensibly our training dataset. We introduce a ...
متن کاملUsing Cognates in a French-Romanian Lexical Alignment System: A Comparative Study
This paper describes a hybrid French Romanian cognate identification module. This module is used by a lexical alignment system. Our cognate identification method uses lemmatized, tagged and sentence-aligned parallel corpora. This method combines statistical techniques, linguistic information (lemmas, POS tags) and orthographic adjustments. We evaluate our cognate identification module and we co...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014